117 research outputs found
ICface: Interpretable and Controllable Face Reenactment Using GANs
This paper presents a generic face animator that is able to control the pose
and expressions of a given face image. The animation is driven by human
interpretable control signals consisting of head pose angles and the Action
Unit (AU) values. The control information can be obtained from multiple sources
including external driving videos and manual controls. Due to the interpretable
nature of the driving signal, one can easily mix the information between
multiple sources (e.g. pose from one image and expression from another) and
apply selective post-production editing. The proposed face animator is
implemented as a two-stage neural network model that is learned in a
self-supervised manner using a large video collection. The proposed
Interpretable and Controllable face reenactment network (ICface) is compared to
the state-of-the-art neural network-based face animation techniques in multiple
tasks. The results indicate that ICface produces better visual quality while
being more versatile than most of the comparison methods. The introduced model
could provide a lightweight and easy to use tool for a multitude of advanced
image and video editing tasks.Comment: Accepted in WACV-202
Multi-modal Dense Video Captioning
Dense video captioning is a task of localizing interesting events from an
untrimmed video and producing textual description (captions) for each localized
event. Most of the previous works in dense video captioning are solely based on
visual information and completely ignore the audio track. However, audio, and
speech, in particular, are vital cues for a human observer in understanding an
environment. In this paper, we present a new dense video captioning approach
that is able to utilize any number of modalities for event description.
Specifically, we show how audio and speech modalities may improve a dense video
captioning model. We apply automatic speech recognition (ASR) system to obtain
a temporally aligned textual description of the speech (similar to subtitles)
and treat it as a separate input alongside video frames and the corresponding
audio track. We formulate the captioning task as a machine translation problem
and utilize recently proposed Transformer architecture to convert multi-modal
input data into textual descriptions. We demonstrate the performance of our
model on ActivityNet Captions dataset. The ablation studies indicate a
considerable contribution from audio and speech components suggesting that
these modalities contain substantial complementary information to video frames.
Furthermore, we provide an in-depth analysis of the ActivityNet Caption results
by leveraging the category tags obtained from original YouTube videos. Code is
publicly available: github.com/v-iashin/MDVCComment: To appear in the proceedings of CVPR Workshops 2020; Code:
https://github.com/v-iashin/MDVC Project Page:
https://v-iashin.github.io/mdv
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network
The objective of this paper is to recover the original component signals from
a mixture audio with the aid of visual cues of the sound sources. Such task is
usually referred as visually guided sound source separation. The proposed
Cascaded Opponent Filter (COF) framework consists of multiple stages, which
recursively refine the source separation. A key element in COF is a novel
opponent filter module that identifies and relocates residual components
between sources. The system is guided by the appearance and motion of the
source, and, for this purpose, we study different representations based on
video frames, optical flows, dynamic images, and their combinations. Finally,
we propose a Sound Source Location Masking (SSLM) technique, which, together
with COF, produces a pixel level mask of the source location. The entire system
is trained end-to-end using a large set of unlabelled videos. We compare COF
with recent baselines and obtain the state-of-the-art performance in three
challenging datasets (MUSIC, A-MUSIC, and A-NATURAL). Project page:
https://ly-zhu.github.io/cof-net.Comment: main paper 14 pages, ref 3 pages, and supp 7 pages. Revised argument
in section 3 and
Digging Deeper into Egocentric Gaze Prediction
This paper digs deeper into factors that influence egocentric gaze. Instead
of training deep models for this purpose in a blind manner, we propose to
inspect factors that contribute to gaze guidance during daily tasks. Bottom-up
saliency and optical flow are assessed versus strong spatial prior baselines.
Task-specific cues such as vanishing point, manipulation point, and hand
regions are analyzed as representatives of top-down information. We also look
into the contribution of these factors by investigating a simple recurrent
neural model for ego-centric gaze prediction. First, deep features are
extracted for all input video frames. Then, a gated recurrent unit is employed
to integrate information over time and to predict the next fixation. We also
propose an integrated model that combines the recurrent model with several
top-down and bottom-up cues. Extensive experiments over multiple datasets
reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up
saliency models perform poorly in predicting gaze and underperform spatial
biases, (3) deep features perform better compared to traditional features, (4)
as opposed to hand regions, the manipulation point is a strong influential cue
for gaze prediction, (5) combining the proposed recurrent model with bottom-up
cues, vanishing points and, in particular, manipulation point results in the
best gaze prediction accuracy over egocentric videos, (6) the knowledge
transfer works best for cases where the tasks or sequences are similar, and (7)
task and activity recognition can benefit from gaze prediction. Our findings
suggest that (1) there should be more emphasis on hand-object interaction and
(2) the egocentric vision community should consider larger datasets including
diverse stimuli and more subjects.Comment: presented at WACV 201
- …